Machine Translation in Microblogs

نویسنده

  • Wang Ling
چکیده

The emergence of social media caused a drastic change in the way information is published. The possibility for people with different backgrounds to publish information has caused non-standard style, formality, content, genre and topics to be present in documents that are published or texted. One such example are posts in microblogs and social networks, such as Twitter, Facebook and Sina Weibo. The people that publish these documents are not professionals, yet the information published can be leveraged for many ends [Han and Baldwin, 2011, Hawn, 2009, Kwak et al., 2010, Sakaki et al., 2010]. However, current NLP tasks, such as Part-of-Speech Tagging perform poorly in the presence of this type of data [Gimpel et al., 2011], since they are trained using traditional features and on existing data. One problem is the lack of annotated datasets in this domain. Another problem is that these models are focused on clean and formal datasets, and make assumptions that do not hold in this domain. One of them is spelling homogeneity, where we assume that there is only one way to spell tomorrow, whereas in microblogs, this word can be abbreviated to tmrw or spelled erroneously as tomorow. In this thesis, we will address the challenge of NLP on the domain of informal online texts, with emphasis on Machine Translation. This thesis makes the following contributions in this respect. (1)We shall present an automatic method to extract such data automatically from microblog posts. Using this in-domain corpora, large improvements can be obtained over systems trained using existing datasets. (2) We also use this corpora to build a microblog normalizer using paraphrasing, which can convert microblog messages into a more standardized text genres, that are more amenable to automatic processing with traditional tools. (3) This will also contribute for better translation quality, and we shall describe methods that will better leverage a microblog normalizer in the MT pipeline. (4)Afterwards, we shall discuss problems with existing evaluation metrics and present metrics that are better suited for this domain. (5) Finally, we shall show applications of the artifacts built in this work on other fields, such as Speech Processing, Natural Language processing and Information Retrieval. We will also show, how the proposed work improves MT in domains other than microblogs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Translation 4 Microblogs

The emergence of social media caused a drastic change in the way information is published. In contrast to previous eras in which the written word was more dominated by formal registers, the possibility for people with different backgrounds to publish information has caused non-standard style, formality, content, genre and topic to be present in written documents. One source of such data are pos...

متن کامل

A Comparative Study of English-Persian Translation of Neural Google Translation

Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...

متن کامل

Mining Parallel Corpora from Sina Weibo and Twitter

Microblogs such as Twitter, Facebook, and Sina Weibo (China’s equivalent of Twitter), are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” m...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013